Spark: RDD Using Text File

The textFile method reads a text file from HDFS/local file system/any hadoop supported file system. Spark core provides textFile() & wholeTextFiles() methods in SparkContext class which is used to read single and multiple text or csv files into a single Spark RDD. Using this method we can also read all files from a directory and files with a specific pattern.

textFile(): Read single or multiple text, csv files and returns a single Spark RDD [String]

wholeTextFile(): Reads single or multiple files and returns a single RDD[Tuple2[String, String]], where first value (_1) in a tuple is a file name and second value (_2) is content of the file.

Read text File from HDFS

val rdd = sc.textFile("/FileStore/tables/orders.txt")
rdd.collect.foreach(f=>{println(f)})

Read text File from Local file System

val rdd = sc.textFile("file:///home/hduser/Desktop/Data/data.txt")

Read all text File from Local file System

If you want to read a entire contents of a file as a single record use wholeTextFiles() method on sparkContext.

val rdd = sc.textFile("/FileStore/tables/*")
rdd.collect.foreach(f=>{println(f)})

RDD From File

val scalaFile = scala.io.Source.fromFile("/data/retail_db/products/part-00000").getLines.toList

val scalaFileRDD = sc.parallelize(productsRaw)

Word Count Program

val rdd = sc.textFile("file:///home/hduser/Desktop/Data/data.txt")
val words = rdd.flatMap(x => x.split(" ")).map(x => (x,1))

val word_count = words.reduceByKey((x, y) => x + y)
word_count.collect()

Spark

RDD Using Text File

No comments:

Post a Comment